0.1 IntentΒΆ
The purpose of this project is to conduct an exploratory data analysis (EDA) of the Supermarket EDA Sales dataset in order to identify the relevant factors that affect gross margin percentage and gross income. The dataset includes information on invoice ID, branch, city, customer type, gender, product line, unit price, quantity, tax, total, date, time, payment, cogs, gross margin percentage, gross income, and rating.
The EDA will be conducted using a variety of statistical and graphical techniques, including distplot, histplot, countplot, regplot, boxplot, lineplot, and heatmap. The goal of the EDA is to gain a deeper understanding of the data and identify the factors that have the strongest influence on gross margin percentage and gross income.
The insights gained from the EDA will be used to inform future marketing and sales strategies. The goal is to identify opportunities to increase gross margin percentage and gross income by targeting specific customer segments, product lines, or marketing channels.
0.1.1 Specific ObjectivesΒΆ
The specific objectives of this project are to:
Explore the distribution of the data and identify any outliers. Identify the relationships between the different variables. Determine the factors that have the strongest influence on gross margin percentage and gross income. Generate insights that can be used to improve marketing and sales strategies. Expected Outcomes
The expected outcomes of this project are to:
Gain a deeper understanding of the Supermarket EDA Sales dataset. Identify the factors that have the strongest influence on gross margin percentage and gross income. Generate insights that can be used to improve marketing and sales strategies.
0.1.2 ContextΒΆ
The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data.
Data Dictionary
Invoice id: Computer generated sales slip invoice identification number
Branch: Branch of supercenter (3 branches are available identified by A, B and C).
City: Location of supercenters
Customer type: Type of customers, recorded by Members for customers using member card and Normal for without member card.
Gender: Gender type of customer
Product line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
Unit price: Price of each product in USD
Quantity: Number of products purchased by customer
Tax: 5% tax fee for customer buying
Total: Total price including tax
Date: Date of purchase (Record available from January 2019 to March 2019)
Time: Purchase time (10am to 9pm)
Payment: Payment used by customer for purchase (3 methods are available β Cash, Credit card and Ewallet)
COGS: Cost of goods sold
Gross margin percentage: Gross margin percentage
Gross income: Gross income
Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)
1.0 Import the necessary libraries.ΒΆ
!pip install --upgrade pip
!pip install --upgrade Pillow
Requirement already satisfied: pip in /opt/conda/lib/python3.10/site-packages (23.1.2)
Collecting pip
Downloading pip-23.2.1-py3-none-any.whl (2.1 MB)
ββββββββββββββββββββββββββββββββββββββββ 2.1/2.1 MB 48.1 MB/s eta 0:00:00
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 23.1.2
Uninstalling pip-23.1.2:
Successfully uninstalled pip-23.1.2
Successfully installed pip-23.2.1
Requirement already satisfied: Pillow in /opt/conda/lib/python3.10/site-packages (9.5.0)
Collecting Pillow
Obtaining dependency information for Pillow from https://files.pythonhosted.org/packages/3d/36/e78f09d510354977e10102dd811e928666021d9c451e05df962d56477772/Pillow-10.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata
Downloading Pillow-10.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (9.5 kB)
Downloading Pillow-10.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (3.4 MB)
ββββββββββββββββββββββββββββββββββββββββ 3.4/3.4 MB 58.4 MB/s eta 0:00:00
Installing collected packages: Pillow
Attempting uninstall: Pillow
Found existing installation: Pillow 9.5.0
Uninstalling Pillow-9.5.0:
Successfully uninstalled Pillow-9.5.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ydata-profiling 4.3.1 requires scipy<1.11,>=1.4.1, but you have scipy 1.11.1 which is incompatible.
Successfully installed Pillow-10.0.0
!pip install calmap
!pip install ydata_profiling
Collecting calmap
Obtaining dependency information for calmap from https://files.pythonhosted.org/packages/94/47/4ca1e36da124e0fe17d27937f35809df1841fb8e2977c61ca00df60c2429/calmap-0.0.11-py2.py3-none-any.whl.metadata
Downloading calmap-0.0.11-py2.py3-none-any.whl.metadata (2.2 kB)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.10/site-packages (from calmap) (3.7.1)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from calmap) (1.23.5)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from calmap) (1.5.3)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (4.40.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (21.3)
Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (10.0.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.10/site-packages (from matplotlib->calmap) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->calmap) (2023.3)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib->calmap) (1.16.0)
Downloading calmap-0.0.11-py2.py3-none-any.whl (7.3 kB)
Installing collected packages: calmap
Successfully installed calmap-0.0.11
Requirement already satisfied: ydata_profiling in /opt/conda/lib/python3.10/site-packages (4.3.1)
Collecting scipy<1.11,>=1.4.1 (from ydata_profiling)
Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
ββββββββββββββββββββββββββββββββββββββββ 34.4/34.4 MB 43.6 MB/s eta 0:00:00
Requirement already satisfied: pandas!=1.4.0,<2.1,>1.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (1.5.3)
Requirement already satisfied: matplotlib<4,>=3.2 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (3.7.1)
Requirement already satisfied: pydantic<2,>=1.8.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (1.10.9)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (6.0)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (3.1.2)
Requirement already satisfied: visions[type_image_path]==0.7.5 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (0.7.5)
Requirement already satisfied: numpy<1.24,>=1.16.0 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (1.23.5)
Requirement already satisfied: htmlmin==0.1.12 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (0.1.12)
Requirement already satisfied: phik<0.13,>=0.11.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (0.12.3)
Requirement already satisfied: requests<3,>=2.24.0 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (2.31.0)
Requirement already satisfied: tqdm<5,>=4.48.2 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (4.65.0)
Requirement already satisfied: seaborn<0.13,>=0.10.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (0.12.2)
Requirement already satisfied: multimethod<2,>=1.4 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (1.9.1)
Requirement already satisfied: statsmodels<1,>=0.13.2 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (0.14.0)
Requirement already satisfied: typeguard<3,>=2.13.2 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (2.13.3)
Requirement already satisfied: imagehash==4.3.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (4.3.1)
Requirement already satisfied: wordcloud>=1.9.1 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (1.9.2)
Requirement already satisfied: dacite>=1.8 in /opt/conda/lib/python3.10/site-packages (from ydata_profiling) (1.8.1)
Requirement already satisfied: PyWavelets in /opt/conda/lib/python3.10/site-packages (from imagehash==4.3.1->ydata_profiling) (1.4.1)
Requirement already satisfied: pillow in /opt/conda/lib/python3.10/site-packages (from imagehash==4.3.1->ydata_profiling) (10.0.0)
Requirement already satisfied: attrs>=19.3.0 in /opt/conda/lib/python3.10/site-packages (from visions[type_image_path]==0.7.5->ydata_profiling) (23.1.0)
Requirement already satisfied: networkx>=2.4 in /opt/conda/lib/python3.10/site-packages (from visions[type_image_path]==0.7.5->ydata_profiling) (3.1)
Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /opt/conda/lib/python3.10/site-packages (from visions[type_image_path]==0.7.5->ydata_profiling) (0.2.0)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2<3.2,>=2.11.1->ydata_profiling) (2.1.3)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (4.40.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (21.3)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.10/site-packages (from matplotlib<4,>=3.2->ydata_profiling) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas!=1.4.0,<2.1,>1.1->ydata_profiling) (2023.3)
Requirement already satisfied: joblib>=0.14.1 in /opt/conda/lib/python3.10/site-packages (from phik<0.13,>=0.11.1->ydata_profiling) (1.2.0)
Requirement already satisfied: typing-extensions>=4.2.0 in /opt/conda/lib/python3.10/site-packages (from pydantic<2,>=1.8.1->ydata_profiling) (4.6.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.24.0->ydata_profiling) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.24.0->ydata_profiling) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.24.0->ydata_profiling) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests<3,>=2.24.0->ydata_profiling) (2023.5.7)
Requirement already satisfied: patsy>=0.5.2 in /opt/conda/lib/python3.10/site-packages (from statsmodels<1,>=0.13.2->ydata_profiling) (0.5.3)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from patsy>=0.5.2->statsmodels<1,>=0.13.2->ydata_profiling) (1.16.0)
Installing collected packages: scipy
Attempting uninstall: scipy
Found existing installation: scipy 1.11.1
Uninstalling scipy-1.11.1:
Successfully uninstalled scipy-1.11.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
momepy 0.6.0 requires shapely>=2, but you have shapely 1.8.5.post1 which is incompatible.
pymc3 3.11.5 requires numpy<1.22.2,>=1.15.0, but you have numpy 1.23.5 which is incompatible.
pymc3 3.11.5 requires scipy<1.8.0,>=1.7.3, but you have scipy 1.10.1 which is incompatible.
Successfully installed scipy-1.10.1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import calmap
#from pandas_profiling import ProfileReport
from ydata_profiling import ProfileReport
# To ignore warinings
import warnings
warnings.filterwarnings('ignore')
<frozen importlib._bootstrap>:241: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject /opt/conda/lib/python3.10/site-packages/numba/core/decorators.py:262: NumbaDeprecationWarning: numba.generated_jit is deprecated. Please see the documentation at: https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-generated-jit for more information and advice on a suitable replacement. warnings.warn(msg, NumbaDeprecationWarning) /opt/conda/lib/python3.10/site-packages/visions/backends/shared/nan_handling.py:51: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. def hasna(x: np.ndarray) -> bool:
1.1 Load DatasetΒΆ
df = pd.read_csv('/kaggle/input/supermarket-eda-sales/supermarket_sales.csv')
2.0 Convert the Date column to datetime format.ΒΆ
df['Date'] = pd.to_datetime(df['Date'])
df['Date']
0 2019-01-05
1 2019-03-08
2 2019-03-03
3 2019-01-27
4 2019-02-08
...
998 2019-02-22
999 2019-02-18
1000 2019-02-18
1001 2019-03-10
1002 2019-01-26
Name: Date, Length: 1003, dtype: datetime64[ns]
2.1 Set the Date column as the index.ΒΆ
df = df.set_index('Date')
3.0 Use seaborn to show the distribution of customer ratings.ΒΆ
sns.distplot(df['Rating'])
plt.axvline(df['Rating'].mean(), color='red',label='mean')
plt.axvline(df['Rating'].quantile(0.25), color='orange',label='25-75th percentile')
plt.axvline(df['Rating'].quantile(0.75), color='orange')
plt.legend()
<matplotlib.legend.Legend at 0x787ccb568a30>
3.1 Results:ΒΆ
The plot shows that the distribution is positively skewed, meaning that there are more ratings towards the lower end of the scale. The mean rating is near 7.5, and the 25th and 75th percentiles are 5.5 and 8.4, respectively. This means that 25% of the ratings are below 5.5, and 75% of the ratings are below 8.4. The 95th percentile is 9.1, so 95% of the ratings are below 9.1.
The following are some key takeaways from the visual:
There is a slight dip in the distribution between the 25th and 75th percentiles. The 95th percentile is relatively high, suggesting that there are a few outliers with very high ratings. These observations suggest that the Rating column is generally well-distributed, with a few outliers at the high end.
sns.countplot(x=df['Branch'])
<Axes: xlabel='Branch', ylabel='count'>
df['Branch'].value_counts()
A 342 B 333 C 328 Name: Branch, dtype: int64
sns.boxplot(x=df['Branch'],y=df['gross income'])
<Axes: xlabel='Branch', ylabel='gross income'>
4.1 Results:ΒΆ
- The distribution of sales numbers is fairly uniform across the three branches.
- There is no clear pattern of which branch has the most sales.
- The difference in sales numbers between the branches is not statistically significant.
5.0 Do aggregate sales numbers differ by much between payment methods?ΒΆ
sns.countplot(x=df['Payment'])
<Axes: xlabel='Payment', ylabel='count'>
df['Payment'].value_counts()
Ewallet 346 Cash 346 Credit card 311 Name: Payment, dtype: int64
5.1 Results:ΒΆ
The distribution of sales numbers is fairly uniform across the three payment methods. There is no clear pattern of which payment method has the most sales.
Overall, the results suggest that there is no significant difference in aggregate sales numbers between the payment methods.
E-wallet is the most popular payment method, followed by cash and credit card. The difference in sales numbers between the payment methods is not statistically significant, suggesting that they are all equally popular.
6.0 Using seaborn regplot to demonstrate the relationship between gross income and customer ratings.ΒΆ
sns.regplot(x='Rating', y='gross income', data=df)
<Axes: xlabel='Rating', ylabel='gross income'>
6.1 Results:ΒΆ
The scatter plot shows a negative correlation between rating and gross income. This means that as rating increases, gross income decreases. The correlation is not perfect, however, as there are some outliers. For example, there are a few points with high rating but low gross income, and vice versa.
The following are some possible explanations for the positive correlation between rating and gross income:
Overall gross income stayed below 10 and ratings had no real effect on the amoount of gross income the store would generate.
7.0 Does gender affect gross income?ΒΆ
sns.boxplot(x=df['Gender'],y=df['gross income'])
<Axes: xlabel='Gender', ylabel='gross income'>
7.1 ResultsΒΆ
Gender affects gross income just in the slightest, females spend about 1-3 more than men which is not enough to support that gender affects gross income.
8.0 Does the store have a higher gross income depending on the time of year it is?ΒΆ
8.1 Group the data by the Date column and then run the .mean() operation.ΒΆ
df_grouped = df.groupby('Date').mean()
df_grouped
| Unit price | Quantity | Tax 5% | Total | cogs | gross margin percentage | gross income | Rating | |
|---|---|---|---|---|---|---|---|---|
| Date | ||||||||
| 2019-01-01 | 54.995833 | 6.454545 | 18.830083 | 395.431750 | 376.601667 | 4.761905 | 18.830083 | 6.583333 |
| 2019-01-02 | 44.635000 | 6.000000 | 11.580375 | 243.187875 | 231.607500 | 4.761905 | 11.580375 | 6.050000 |
| 2019-01-03 | 59.457500 | 4.625000 | 12.369813 | 259.766062 | 247.396250 | 4.761905 | 12.369813 | 8.112500 |
| 2019-01-04 | 51.743333 | 5.333333 | 12.886417 | 270.614750 | 257.728333 | 4.761905 | 12.886417 | 6.516667 |
| 2019-01-05 | 61.636667 | 4.583333 | 14.034458 | 294.723625 | 280.689167 | 4.761905 | 14.034458 | 7.433333 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2019-03-26 | 42.972308 | 4.000000 | 7.188692 | 150.962538 | 143.773846 | 4.761905 | 7.188692 | 6.623077 |
| 2019-03-27 | 56.841000 | 4.500000 | 13.822950 | 290.281950 | 276.459000 | 4.761905 | 13.822950 | 6.760000 |
| 2019-03-28 | 45.525000 | 4.800000 | 10.616200 | 222.940200 | 212.324000 | 4.761905 | 10.616200 | 7.050000 |
| 2019-03-29 | 66.346250 | 6.750000 | 23.947875 | 502.905375 | 478.957500 | 4.761905 | 23.947875 | 6.925000 |
| 2019-03-30 | 67.408182 | 5.888889 | 19.424500 | 407.914500 | 388.490000 | 4.761905 | 19.424500 | 6.800000 |
89 rows Γ 8 columns
8.2 Using the new df_grouped dataset plot a lineplot to demonstrate the correlation between date and gross income.ΒΆ
sns.lineplot(x=df_grouped.index,
y=df_grouped['gross income'])
<Axes: xlabel='Date', ylabel='gross income'>
8.3 Results:ΒΆ
The line plot shows that gross income was relatively low in March. The highest gross income was recorded on February 11, 2019.
The trend of gross income is generally decreasing, but there are some fluctuations. There is a slight peak in gross income around February 10-13, 2019 with an immediate dip to record lows just days later. The lowest gross income was recorded on February 15, 2019. Overall, the line plot shows that gross income from supermarket sales has been decreasing over time. This suggests that the supermarket is doing poor financially.
9.0 Clean duplicated rows and missing valuesΒΆ
9.1 Count the number of duplicate rows in the df DataFrame.ΒΆ
df.duplicated().sum()
3
9.2 Select and print the duplicate rows in the df DataFrame.ΒΆ
df[df.duplicated()==True]
| Invoice ID | Branch | City | Customer type | Gender | Product line | Unit price | Quantity | Tax 5% | Total | Time | Payment | cogs | gross margin percentage | gross income | Rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||||||
| 2019-02-18 | 849-09-3807 | A | Yangon | Member | Female | Fashion accessories | 88.34 | 7.0 | 30.919 | 649.299 | 13:28 | Cash | 618.38 | 4.761905 | 30.919 | 6.6 |
| 2019-03-10 | 745-74-0715 | A | Yangon | Normal | Male | Electronic accessories | NaN | 2.0 | 5.803 | 121.863 | 20:46 | Ewallet | 116.06 | 4.761905 | 5.803 | 8.8 |
| 2019-01-26 | 452-04-8808 | B | Mandalay | Normal | Male | Electronic accessories | 87.08 | NaN | 30.478 | 640.038 | 15:17 | Cash | 609.56 | 4.761905 | 30.478 | 5.5 |
9.3 Drop the duplicate rows in the df DataFrame.ΒΆ
df.drop_duplicates(inplace=True)
9.4 Calculate the proportion of missing values in the df DataFrame.ΒΆ
df.isna().sum()/len(df)
Invoice ID 0.000 Branch 0.000 City 0.000 Customer type 0.079 Gender 0.000 Product line 0.043 Unit price 0.006 Quantity 0.019 Tax 5% 0.000 Total 0.000 Time 0.000 Payment 0.000 cogs 0.000 gross margin percentage 0.000 gross income 0.000 Rating 0.000 dtype: float64
9.5 Create a heatmap of the missing values in the df DataFrame.ΒΆ
sns.heatmap(df.isnull(),cbar=False)
<Axes: ylabel='Date'>
9.6 Fill all missing values in the df DataFrame with the average value in each column.ΒΆ
df.fillna(df.mean(),inplace=True)
9.7 Fill all missing values in the df DataFrame with the most frequent value in each column.ΒΆ
df.fillna(df.mode().iloc[0],inplace=True)
10.0 Generate a Profile Report for a quick and easy view of duplicatesm missing values, correlations, and more.ΒΆ
dataset = pd.read_csv('/kaggle/input/supermarket-eda-sales/supermarket_sales.csv')
profile = ProfileReport(dataset, title="Pandas Profiling Report")
profile.to_notebook_iframe()
11.0 Correlation AnalysisΒΆ
11.1 Calculate the correlation coefficient between the gross income and rating columns.ΒΆ
round(np.corrcoef(df['gross income'],df['Rating'])[1][0],2)
-0.04
11.2 ResultsΒΆ
-0.04, indicates a very weak negative correlation between gross income and rating. This means that as gross income increases, rating tends to decrease slightly. However, the correlation is very weak, so it is not clear that there is a causal relationship between the two variables.
11.3 Correlation matrixΒΆ
np.round(df.corr(),2)
| Unit price | Quantity | Tax 5% | Total | cogs | gross margin percentage | gross income | Rating | |
|---|---|---|---|---|---|---|---|---|
| Unit price | 1.00 | 0.01 | 0.63 | 0.63 | 0.63 | NaN | 0.63 | -0.01 |
| Quantity | 0.01 | 1.00 | 0.70 | 0.70 | 0.70 | NaN | 0.70 | -0.02 |
| Tax 5% | 0.63 | 0.70 | 1.00 | 1.00 | 1.00 | NaN | 1.00 | -0.04 |
| Total | 0.63 | 0.70 | 1.00 | 1.00 | 1.00 | NaN | 1.00 | -0.04 |
| cogs | 0.63 | 0.70 | 1.00 | 1.00 | 1.00 | NaN | 1.00 | -0.04 |
| gross margin percentage | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| gross income | 0.63 | 0.70 | 1.00 | 1.00 | 1.00 | NaN | 1.00 | -0.04 |
| Rating | -0.01 | -0.02 | -0.04 | -0.04 | -0.04 | NaN | -0.04 | 1.00 |
11.4 ResultsΒΆ
There is a strong positive correlation between unit price and cogs (0.63), between quantity and gross income (0.70), and between tax 5% and total (1.00). This means that as one variable increases, the other variable tends to increase as well.
There are also a number of weak correlations between the variables in the data set. For example, there is a weak positive correlation between rating and gross income (0.01), and a weak negative correlation between rating and cogs (-0.01). This means that as one variable increases, the other variable tends to increase or decrease slightly.
11.5 Heatmap to show correlation between the different columns in the data.ΒΆ
sns.heatmap(np.round(df.corr(),2),annot=True)
<Axes: >
11.6 ResultsΒΆ
- The correlation between unit price and total is not surprising, as these two variables are directly related.
- The correlation between quantity and total is also not surprising, as these two variables are also directly related.
- The correlation between tax 5% and total is weak, but it is still significant. This suggests that tax 5% has a small but noticeable impact on total sales.
- The correlation between cogs and gross margin percentage is weak, but it is negative. This suggests that as cogs increases, gross margin percentage decreases. This is because cogs is a cost, so as cogs increases, there is less money left over for gross income.
- The correlation between gross margin percentage and gross income is weak, but it is positive. This suggests that as gross margin percentage increases, gross income also increases. This is because gross margin percentage is a measure of profitability, so as gross margin percentage increases, there is more money left over for gross income.